AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)Problem Defination :
AllLife Bank aims to enhance its loan business by converting liability customers into personal loan customers while retaining them as depositors.
I (Sujit) as a Data Scientist, the objective is to develop a predictive model to identify potential customers likely to purchase personal loans. The goal is to understand key customer attributes driving purchases, providing actionable insights for the marketing department. The model's success will optimize target marketing strategies, increase the personal loan conversion rate, and identify specific customer segments for targeted campaigns.
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
## To get confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
## using read_csv from pandas library to read our data
data = pd.read_csv("C:/Users/sthakur4/OneDrive - Biogen/Documents/PGP MLAI/Machine Learning/project/Loan_Modelling.csv")
## making copy of our data to keep orignal data intact using .copy()
df = data.copy()
## Lets check first 5 rows of the data using .head()
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
## Lets take a look at last 5 rows using .tail()
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
## Lets check shape of data
df.shape
(5000, 14)
## Lets check data types of column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
## Lets look at statistical summary of the dataset using describe()b
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
We can see from above summary that,
A)ID:
B)Age:
Customers in the dataset have an average age of approximately 45 years, with a minimum age of 23 and a maximum age of 67.
The age distribution appears to be relatively symmetrical, with the median (50th percentile) and mean close in value. but still mean - median = 45.3 - 45 = 0.33 , That means it is slightly right skewed.
C) Experience:
The average professional experience of customers is around 20 years, with a minimum reported experience of -3 years (which might be an anomaly or data error).
Experience ranges from -3 to 43 years, indicating a diverse range of experience levels. It also shows that experience shows -3 as minimum that means there is some discrepancy in data , hence so lets clean our data while preprocessing
D) Income:
The average income is 73.7 dollars , with a minimum income of 8 dollars and a maximum of 224 dollars.
Income distribution is positively skewed, as the mean is higher than the median.
E) ZIPCode:
F) Family:
The average number of family members is approximately 2.4, with a range from 1 to 4.
The distribution is slightly positively skewed, with more customers having smaller families.
G) CCAvg:
The average spending on credit cards per month is 1.94 Dollars , ranging from 0 to 10 Dollars.
The distribution is positively skewed, indicating that most customers have lower credit card spending.
H)Education:
The majority of customers have an education level of 1 or 2, with a few having a level of 3.
Education levels are discrete and categorized.
I) Mortgage:
J) Personal Loan:
K) Securities Account, CD Account, Online, Credit Card:
## Lets see if there is any duplicate values
df[df.duplicated()].sum()
ID 0.0 Age 0.0 Experience 0.0 Income 0.0 ZIPCode 0.0 Family 0.0 CCAvg 0.0 Education 0.0 Mortgage 0.0 Personal_Loan 0.0 Securities_Account 0.0 CD_Account 0.0 Online 0.0 CreditCard 0.0 dtype: float64
We can see from above code that we dont have any duplicacy in our data
Questions:
## Before answering questions and performing Univariate and Bivariate analysis lets write all plotting functions
## lets make a function for plotting histogram and boxplot together for target variable
def dist_with_target_plots (data , predictor , target): ## initializing function
fig, axs = plt.subplots(2, 2, figsize=(12, 10)) ## figure size for proper visibility
target_uniq = data[target].unique() ## getting unique value from target variable
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0])) ## setting title for first histogram
sns.histplot( ## Plotting histogram
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0,1].set_title("Distrinution of target for target =" + str(target_uniq[1])) ## setting title for second histogram
sns.histplot(
data = data[data[target]==target_uniq[1]], ## Plotting histogram
x= predictor,
kde=True,
ax =axs[0,1],
color="teal",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target") ## Plotting first boxplot with outliers
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target") ## Plotting second boxplot wi
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
## lets make a function for plotting proportion graph
def dist_with_target2(data, predictor, target): ## initializing function
# Create a new DataFrame with the selected columns
dt = data[[predictor, target]].copy()
# Create age groups
bins = [20, 25, 30, 35, 40, 45, 50, 55, 60 , 65]
labels = ['20-25','25-30', '30-35', '35-40', '40-45', '45-50', '50-55', '55-60','60-65']
dt[predictor] = pd.cut(df[predictor], bins=bins, labels=labels, right=False)
# Calculate the distribution of the target variable for each age group
dist_table = pd.crosstab(index=df[predictor], columns=df[target], normalize='index')
# Plot the distribution
plt.figure(figsize=(10, 6))
sns.barplot(x=dist_table.index, y=dist_table[1], color='skyblue', label='Bought Loan')
sns.barplot(x=dist_table.index, y=dist_table[0], color='lightcoral', bottom=dist_table[1], label="Didn't Buy Loan")
plt.title("Loan Purchase Distribution by Age Group") ## setting title
plt.xlabel("Age Group") ## changing x label
plt.ylabel("Proportion") ## changing y label
plt.legend(title=target) ## plotting legend
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None): ## initializing function
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot( ## plotting count plot
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
## Writing function for a plot which will plot histogram and box plot
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None): ## initializing function
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
## Writing a function for stacked barplot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique() ### unique from predictor
sorter = data[target].value_counts().index[-1] ### counting values of target
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values( ## making a cross tab of all values
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5)) ## plotting bar graph
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target): ## initilizing functions
fig, axs = plt.subplots(2, 2, figsize=(12, 10)) ## setting size for better visibility
target_uniq = data[target].unique() ## unique from target
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0])) ## setting title
sns.histplot(
data=data[data[target] == target_uniq[0]], ## plotting histogram
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor, ## plotting histogram
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow") ## box plot with outliers
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data, ## box plot without outliers
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6)) ## making figure size for better visibility
# Plot 1: histogram
sns.histplot(x="Mortgage", data = df, ax=axes[0],kde=True) ## plotting histogram with kde
axes[0].set_title('Histogram: Cuisine Type vs Cost of Order'); ## Setting Title of plot
# Plot 2: Boxplot
sns.set_style("whitegrid"); ## changing grid style
sns.boxplot(data=df, x = "Mortgage", ax=axes[1]) ## plotting boxplot
axes[1].set_title('Boxplot: Cuisine Type vs Cost of Order'); ## Setting Title of plot
Q1 = df["Mortgage"].quantile(0.25) # Complete the code to find the 25th percentile and 75th percentile.
Q3 = df["Mortgage"].quantile(0.35) # Complete the code to find the 75th percentile and 75th percentile.
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_bound = Q1 - 1.5 * IQR # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_bound = Q3 + 1.5 * IQR
outliers_count = ((df["Mortgage"] < lower_bound) | (df["Mortgage"] > upper_bound)).sum()
## getting count of outliers
percentage_outliers = (outliers_count / len(data)) * 100 ## calculating percentage of outliers
print(f"Percentage of outliers in Mortgage column is: {percentage_outliers:.2f}%")
Percentage of outliers in Mortgage column is: 30.76%
From "CreditCard" Column , If it says "1" that means customer has a credit card from some other bank than All Life Bank
That means Customer which have atleast some spending from Credit Card (from CC Average column) but has "0" in Credit Card column that means doesnt have any other banks credit card that means it has our banks (All Life banks) credit card
## Getting Answer for Part A
Customer_CreditCard = df[df["CreditCard"]==1] ## making a subset of data where creditcard is 1
otherBank =Customer_CreditCard["ID"].value_counts().sum()
print("Hence ",Customer_CreditCard["ID"].value_counts().sum() ,"are total number of customers who owns credit card from any bank other than All Life Bank" ) ## Printing answer
Hence 1470 are total number of customers who owns credit card from any bank other than All Life Bank
## Getting Answer for Part B
AlllifeCreditcard = df[(df["CreditCard"] == 0) & (df["CCAvg"] > 0)] ## making subset where Creditcard is 0 and ccavg is greater than 0
All_life=AlllifeCreditcard["ID"].value_counts().sum()
print("Hence ",AlllifeCreditcard["ID"].value_counts().sum() ,"are total number of customers who owns credit card from a All Life Bank" ) ## Pri
Hence 3452 are total number of customers who owns credit card from a All Life Bank
## Getting Answer for Part C
All_cus = All_life+ otherBank
print("All Customers with Credit card {}".format(All_cus))
All Customers with Credit card 4922
#### Customer with Credit Card from any bank other than All Life Bank are 1470 Customers
#### Customer with Credit Card from All Life Bank are 3452 Customers
#### Customer with Credit Card from All life or any other bank are 4492 Customers
mat = df.corr() ## Calculating correlation of our data and storing it in variable mat
plt.figure(figsize=(12,14)); ## Increasing figure size for proper visibility
sns.heatmap(mat,annot=True, fmt=".2f",annot_kws={"size": 10}); ## olotting heat map of same
plt.title("Correlation HeatMap")
Text(0.5, 1.0, 'Correlation HeatMap')
dist_with_target_plots(df, 'Age', 'Personal_Loan') ### plotting distribution with target function plot of age and personalloan
dist_with_target2(df,'Age', 'Personal_Loan') ###Plotting proportional plot using manual made fucntion for age vs personal loan
stacked_barplot(data, "Education", "Personal_Loan") ## plot stacked bar plot education and personal loan
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
print("For Level 3 ,{}% of people purchased loan".format((205/1501)*100)) ## printing calculation
print("For Level 2 ,{}% of people purchased loan".format((182/1403)*100))
print("For Level 1 ,{}% of people purchased loan".format((93/2096)*100))
For Level 3 ,13.657561625582945% of people purchased loan For Level 2 ,12.972202423378477% of people purchased loan For Level 1 ,4.437022900763359% of people purchased loan
histogram_boxplot(data, "Age") ## Plotting the histogram box plot for Age
print("Max age is {} and Min age is {} with median age of {}".format((df["Age"].max()),(df["Age"].min()),df["Age"].median()))
print("Mean age is {}".format(df["Age"].mean())) ## printing values
Max age is 67 and Min age is 23 with median age of 45.0 Mean age is 45.3384
histogram_boxplot(df,"Experience") ## Plotting the histogram box plot for Experience
histogram_boxplot(df,"Income") ## Plotting the histogram box plot for Income
histogram_boxplot(df,'CCAvg') ## create histogram_boxplot for CCAvg
histogram_boxplot(df,"Mortgage") ## Complete the code to create histogram_boxplot for Mortgage
histogram_boxplot(df,"Family") ## create histogram_boxplot for Family
labeled_barplot(df,"Education") ## Create labeled_barplot for Education
We can see from above labeled bar plot that we have three different categories in education ,
We can see that our data has highest people with undergrad and lowest with masters
labeled_barplot(df,"Securities_Account") ## create labeled_barplot for security account
labeled_barplot(df,"CD_Account") ## create labeled_barplot for CD_Account
labeled_barplot(df,"Online") ## create labeled_barplot for Online
labeled_barplot(df,"CreditCard") ## create labeled_barplot for credit card
df["ZIPCode"].nunique() ## getting number of unique values in Zipcode
467
df_zip_group = df.copy() ## making a copy to keep our main data intact
df_zip_group["ZIPCode"] = df_zip_group["ZIPCode"].astype("str") ## making zipcode in string
df_zip_group["ZIPcode"] = df_zip_group["ZIPCode"]
df_zip_group["ZIPCode_group"] = df_zip_group["ZIPCode"].str[0:2] ## using first two digits of zip to group them
df_zip_group["ZIPCode_group"] = df_zip_group["ZIPCode_group"].astype("category")
df_zip_group.head() ## showing first 5 rows
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | ZIPcode | ZIPCode_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 91107 | 91 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 90089 | 90 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 94720 | 94 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 94112 | 94 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 91330 | 91 |
df_zip_group["ZIPCode_group"].nunique() ## getting unique values from new zip code group
7
plt.figure(figsize=(10,8))
ax = sns.countplot(x='ZIPCode_group', data=df_zip_group) ## Plotting a count plot
# Adding labels to the bars
for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.title('Countplot of ZIPCode Groups'); ## changing title of plot
plt.figure(figsize=(15, 7)) ## setting size of visual for proper visibility
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Complete the code to get the heatmap of the data
plt.show()
stacked_barplot(df,"Family","Personal_Loan") ## plot stacked barplot for Personal Loan and Family
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
print("For family of 4 , {} % of people purchased loan".format((134/1222)*100))
print("For family of 3 , {} % of people purchased loan".format((133/1010)*100)) ## printing calculation
print("For family of 2 , {} % of people purchased loan".format((106/1296)*100))
print("For family of 1 , {} % of people purchased loan".format((107/1472)*100))
For family of 4 , 10.965630114566286 % of people purchased loan For family of 3 , 13.16831683168317 % of people purchased loan For family of 2 , 8.179012345679013 % of people purchased loan For family of 1 , 7.2690217391304355 % of people purchased loan
stacked_barplot(df,"Securities_Account","Personal_Loan") ## plot stacked barplot for Personal Loan and Securities_Account
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
print("For People with Securities Account , {} % of people purchased loan".format((60/522)*100)) ## printing calculation
print("For People with no Securities Account , {} % of people purchased loan".format((420/4478)*100))
For People with Securities Account , 11.494252873563218 % of people purchased loan For People with no Securities Account , 9.379187137114783 % of people purchased loan
stacked_barplot(df,"CD_Account","Personal_Loan") ## plot stacked barplot for Personal Loan and CD_Account
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
print("For People with CD Account , {} % of people purchased loan".format((140/302)*100)) ## printing calculation
print("For People with no CD Account , {} % of people purchased loan".format((340/4698)*100))
For People with CD Account , 46.35761589403973 % of people purchased loan For People with no CD Account , 7.237122179650915 % of people purchased loan
stacked_barplot(df,"Online","Personal_Loan") ## plot stacked barplot for Personal Loan and Online
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
print("For People with CD Account , {} % of people purchased loan".format((140/302)*100)) ## printing calculation
print("For People with Online Banking , {} % of people purchased loan".format((291/2984)*100))
print("For People with no Online Banking , {} % of people purchased loan".format((189/2016)*100))
For People with CD Account , 46.35761589403973 % of people purchased loan For People with Online Banking , 9.75201072386059 % of people purchased loan For People with no Online Banking , 9.375 % of people purchased loan
stacked_barplot(df,"CreditCard","Personal_Loan") ## plot stacked barplot for Personal Loan and CreditCard
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
print("For People with CD Account , {} % of people purchased loan".format((140/302)*100)) ## printing calculation
print("For People with Credit Card , {} % of people purchased loan".format((143/1470)*100))
print("For People with no Credit Card , {} % of people purchased loan".format((337/3530)*100))
For People with CD Account , 46.35761589403973 % of people purchased loan For People with Credit Card , 9.727891156462585 % of people purchased loan For People with no Credit Card , 9.546742209631729 % of people purchased loan
distribution_plot_wrt_target(df,"Experience","Personal_Loan") ## plot stacked barplot for Personal Loan and Experience
distribution_plot_wrt_target(df,"Income","Personal_Loan") ## plot stacked barplot for Personal Loan and Income
distribution_plot_wrt_target(df,"CCAvg","Personal_Loan") ## plot stacked barplot for Personal Loan and CCAvg
sns.set(style="whitegrid") # Optional: Set a background style
ax = sns.countplot(x='ZIPCode_group', hue='Personal_Loan', data=df_zip_group) ## Plot count plot
# Adding labels to the bars
for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
# Adding labels to the plot
plt.title('Stacked Bar Plot of ZIPCode Groups and Personal Loan')
plt.xlabel('ZIPCode Groups')
plt.ylabel('Count')
# Adding a legend with labels
plt.legend(title='Personal Loan', labels=['No Loan', 'Loan'])
<matplotlib.legend.Legend at 0x1bb6c106710>
print( "{} % of people from zipcode group 90 has purchased loan".format((67/(67+636))*100)) ## Printing calculations
print( "{} % of people from zipcode group 91 has purchased loan".format((55/(55+510))*100))
print( "{} % of people from zipcode group 92 has purchased loan".format((94/(94+894))*100))
print( "{} % of people from zipcode group 93 has purchased loan".format((43/(43+374))*100))
print( "{} % of people from zipcode group 94 has purchased loan".format((138/(138+1334))*100))
print( "{} % of people from zipcode group 95 has purchased loan".format((80/(80+735))*100))
print( "{} % of people from zipcode group 96 has purchased loan".format((3/(3+37))*100))
9.53058321479374 % of people from zipcode group 90 has purchased loan 9.734513274336283 % of people from zipcode group 91 has purchased loan 9.51417004048583 % of people from zipcode group 92 has purchased loan 10.311750599520384 % of people from zipcode group 93 has purchased loan 9.375 % of people from zipcode group 94 has purchased loan 9.815950920245399 % of people from zipcode group 95 has purchased loan 7.5 % of people from zipcode group 96 has purchased loan
df.head() ## checking first 5 rows
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
## Lets check if experience column has wrong entry as we found out from univariate analysis and from description that it might have some
## discrepancies
df[df["Experience"] < 0]["Experience"].unique()
array([-1, -2, -3], dtype=int64)
# Correcting the experience values
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
467
df["ZIPCode"] = df["ZIPCode"].astype(str) ## converting to string
print(
"Number of unique values if we take first two digits of ZIPCode: ",
df["ZIPCode"].str[0:2].nunique(), ## printing unique values
)
Number of unique values if we take first two digits of ZIPCode: 7
df["ZIPCode"] = df["ZIPCode"].str[0:2] ## Replacing the values of Zip codes with first two numbers
df["ZIPCode"] = df["ZIPCode"].astype("category") ## converting to category type
## Converting the data type of categorical features to 'category'
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category") # convert the cat_cols to category
df.head() ## looking at first 5 rows
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
Q1 = data.quantile(0.25) # finding the 25th percentile
Q3 = data.quantile(0.35) # finding the 75th percentile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower = Q1 - 1.5 * IQR # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
((df.select_dtypes(include=["float64", "int64"]) < lower)
|(df.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100 ## printing % of outliers in all columns
Age 50.06 CCAvg 63.12 CD_Account 0.00 CreditCard 0.00 Education 0.00 Experience 57.28 Family 24.44 ID 60.00 Income 73.82 Mortgage 30.76 Online 0.00 Personal_Loan 0.00 Securities_Account 0.00 dtype: float64
df.drop(columns="ID",axis =1 , inplace = True) ## Dropping the ID column
X= df.drop(columns="Personal_Loan",axis =1 ) ## making independent variables in X
y= df["Personal_Loan"] ## making dependent variable in Y as its our target variable
X = pd.get_dummies(X,columns=["ZIPCode","Education"],drop_first=True) ## one hot coding for ZIPCode and Education
X
| Age | Experience | Income | Family | CCAvg | Mortgage | Securities_Account | CD_Account | Online | CreditCard | ZIPCode_91 | ZIPCode_92 | ZIPCode_93 | ZIPCode_94 | ZIPCode_95 | ZIPCode_96 | Education_2 | Education_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 29 | 3 | 40 | 1 | 1.9 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4996 | 30 | 4 | 15 | 4 | 0.4 | 85 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4997 | 63 | 39 | 24 | 2 | 0.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 4998 | 65 | 40 | 49 | 3 | 0.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 28 | 4 | 83 | 3 | 0.8 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5000 rows × 18 columns
X_train, X_test , y_train , y_test = train_test_split(X,y,test_size=0.3,random_state=1,stratify=y) ## splitting test and train
## Lets see how our test and train set are divided
print("Shape of training set", X_train.shape)
print("Shape of test set", X_test.shape)
print("-"*100)
print("Distribution of class in training data is ")
print(y_train.value_counts(normalize=True))
print("Distribution of class in test data is ")
print(y_test.value_counts(normalize=True))
Shape of training set (3500, 18) Shape of test set (1500, 18) ---------------------------------------------------------------------------------------------------- Distribution of class in training data is 0 0.904 1 0.096 Name: Personal_Loan, dtype: float64 Distribution of class in test data is 0 0.904 1 0.096 Name: Personal_Loan, dtype: float64
Model can make wrong predictions as:
Which case is more important?
How to reduce this loss i.e need to reduce False Negatives?
Recall to be maximized, greater the Recall higher the chances of minimizing false negatives. Hence, the focus should be on increasing Recall or minimizing the false negatives.First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
def model_perf (model , predictors,target): ## function initializing
pred = model.predict(predictors) ## predicting values for dependent
accuracy = accuracy_score(target,pred) ## calculating accuracy score
precision = precision_score(target,pred) ## calculating Precision score
recall= recall_score(target,pred) ## calculating recall score
f1 = f1_score(target,pred) ## calculating f1 score score
df_perf = pd.DataFrame({
"Accuracy":accuracy,"Precision":precision,"Recall":recall,"f1-score":f1
},index=[0]) ## making all score in a dataframe
return df_perf
## lets make a function for confusion matrix
def conf_mat (model,predictors,target): ## Initializing the function
pred=model.predict(predictors) ## predicting values for dependent using predictors
cm=confusion_matrix(target,pred) ## calculating confusion matrix
labels = np.asarray(
[
["{0:0.0f}".format(item)+"\n{0:.2%}".format(item/cm.flatten().sum())] ## labels for confusion matrix
for item in cm.flatten()
]
).reshape(2,2)
plt.figure(figsize=(6,5)) ## setting fig size for better visibility
plt.title("Confusion Matrix") ## setting title for plot
sns.heatmap(cm,annot=labels,fmt="") ## plotting heat map for confusion matrix
plt.xlabel("Predicted") ## x label
plt.ylabel("True") ## ylabel
## lets make function for visualizing tree
def show_me_tree(model, predictors): ## Initializing function
feature_names=predictors.columns.tolist() ## getting feature names
plt.figure(figsize=(20,30)) ## setting fig size for better visibility
out=tree.plot_tree(
model, ## plotting decision tree
feature_names=feature_names,
class_names=None,
node_ids=False,
filled=True,
fontsize=9,
)
for o in out: ## this will make sure that all arrows are drawn
arrow=o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
## now lets make a function to plot feature importances
def show_me_feature_imp(model,predictors): ## initializing function
feature_names=predictors.columns.tolist() ## Getting feature name
importances=model.feature_importances_ ## calculating feature importance
indices=np.argsort(importances) ## sorting the importances
fig,ax = plt.subplots(figsize=(10,8)) ## setting fig size for better visibility
ax.barh(range(len(indices)),importances[indices],color="violet",align="center") ## plotting bar graph
ax.set_yticks(range(len(indices)),[feature_names[i] for i in indices]) ## setting y ticks
ax.set_xlabel("Relative Importance") ## xlabel
ax.set_ylabel("Feature Names") ## y label
plt.show()
model = DecisionTreeClassifier(criterion="gini",random_state=1) ## initializing decision tree classifier
model.fit(X_train,y_train) ## fitting model
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
basic_decision_tree_perf_train = model_perf(model,X_train,y_train) ## calculating model performance
conf_mat(model ,X_train,y_train) ## plotting confusion matrix
basic_decision_tree_perf_train
| Accuracy | Precision | Recall | f1-score | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
show_me_tree(model,X_train) ## plotting decision tree
print(tree.export_text(model,feature_names=X_train.columns.tolist(),show_weights=True)) ## printing the rules for tree
|--- Income <= 104.50 | |--- CCAvg <= 2.95 | | |--- weights: [2519.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- Income <= 81.50 | | | | | | |--- Experience <= 12.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Experience > 12.50 | | | | | | | |--- weights: [61.00, 0.00] class: 0 | | | | | |--- Income > 81.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Age <= 30.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 30.00 | | | | | | | | |--- Experience <= 19.50 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- Experience > 19.50 | | | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 3.05 | | | | | | | | | | |--- CCAvg <= 3.70 | | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | | |--- CCAvg > 3.70 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- Online > 0.50 | | | | | | | |--- Income <= 82.50 | | | | | | | | |--- Family <= 2.00 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Family > 2.00 | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Income > 82.50 | | | | | | | | |--- weights: [25.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- CCAvg <= 4.40 | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- CCAvg > 4.40 | | | | | |--- weights: [1.00, 0.00] class: 0 | | |--- Income > 92.50 | | | |--- CCAvg <= 4.45 | | | | |--- Education_3 <= 0.50 | | | | | |--- Education_2 <= 0.50 | | | | | | |--- Age <= 61.50 | | | | | | | |--- CCAvg <= 4.35 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 4.35 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Age > 61.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Education_2 > 0.50 | | | | | | |--- Experience <= 36.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | |--- Experience > 36.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- Education_3 > 0.50 | | | | | |--- Family <= 2.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- Mortgage <= 74.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Mortgage > 74.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Family > 2.50 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- CCAvg > 4.45 | | | | |--- Mortgage <= 320.00 | | | | | |--- Age <= 57.50 | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | |--- Age > 57.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Mortgage > 320.00 | | | | | |--- weights: [0.00, 1.00] class: 1 |--- Income > 104.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [458.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.85 | | | | | | |--- Experience <= 4.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Experience > 4.50 | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.85 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 54.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 33.00 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Age > 33.00 | | | | | | |--- Experience <= 22.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- Experience > 22.50 | | | | | | | |--- Age <= 48.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Age > 48.50 | | | | | | | | |--- Mortgage <= 80.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- Mortgage > 80.50 | | | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 67.00] class: 1 | |--- Family > 2.50 | | |--- Income <= 114.50 | | | |--- Experience <= 3.50 | | | | |--- weights: [10.00, 0.00] class: 0 | | | |--- Experience > 3.50 | | | | |--- Experience <= 31.50 | | | | | |--- Family <= 3.50 | | | | | | |--- CCAvg <= 2.90 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- Income <= 109.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Income > 109.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CCAvg > 2.90 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- ZIPCode_95 <= 0.50 | | | | | | | |--- weights: [0.00, 10.00] class: 1 | | | | | | |--- ZIPCode_95 > 0.50 | | | | | | | |--- Income <= 110.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Income > 110.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Experience > 31.50 | | | | | |--- Income <= 113.50 | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- Income > 113.50 | | | | | | |--- Age <= 62.00 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 62.00 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | |--- Income > 114.50 | | | |--- weights: [0.00, 155.00] class: 1
show_me_feature_imp(model,X_train) ## plotting feature importance for above model
basic_decision_tree_perf_test = model_perf(model,X_test,y_test) ## calculating performance on test data
conf_mat(model,X_test,y_test) ## plotting confusion matrix
basic_decision_tree_perf_test
| Accuracy | Precision | Recall | f1-score | |
|---|---|---|---|---|
| 0 | 0.982667 | 0.94697 | 0.868056 | 0.905797 |
estimator=DecisionTreeClassifier(random_state=1) ## initilaizing decision tree classifier
parameters = {
"max_depth":np.arange(6,18), ## getting every desired possible paramter in dictionary
"min_samples_leaf":np.arange(1,11),
"max_leaf_nodes":np.arange(1,11)
}
grid_obj = GridSearchCV(estimator,parameters,cv=5,scoring=make_scorer(recall_score)) ## running grid search cv
grid_obj = grid_obj.fit(X_train,y_train) ## fitting data in grid search cv
estimator = grid_obj.best_estimator_ ## finding best estimator
estimator.fit(X_train,y_train) ## fitting model with best estimator
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=5, random_state=1)
decision_tree_tune_perf_train = model_perf(estimator,X_train,y_train) ## calculating performance of pre pruned model
conf_mat(estimator,X_train,y_train) ## plotting confusion matrix
decision_tree_tune_perf_train
| Accuracy | Precision | Recall | f1-score | |
|---|---|---|---|---|
| 0 | 0.978286 | 0.871429 | 0.907738 | 0.889213 |
show_me_tree(estimator,X_train) ## plotting decision tree for pre pruned model
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=X_train.columns.tolist(), show_weights=True))
|--- Income <= 104.50 | |--- weights: [2661.00, 31.00] class: 0 |--- Income > 104.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [458.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- weights: [8.00, 61.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [9.00, 74.00] class: 1 | |--- Family > 2.50 | | |--- weights: [28.00, 170.00] class: 1
show_me_feature_imp(estimator,X_train) ## ploting feature importance of pre pruned model
decision_tree_tune_perf_test = model_perf(estimator,X_test,y_test) ## calculating performance of test data in pre pruned model
conf_mat(estimator,X_test,y_test) ## plotting confusion matrix of test data of pre pruned model
decision_tree_tune_perf_test
| Accuracy | Precision | Recall | f1-score | |
|---|---|---|---|---|
| 0 | 0.960667 | 0.777778 | 0.826389 | 0.801347 |
Total impurity of leaves vs effective alphas of pruned tree
Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.
clf = DecisionTreeClassifier(random_state=1) ## initializing decision tree classifier
path = clf.cost_complexity_pruning_path(X_train,y_train) ## getting path
ccp_alphas,impurities = path.ccp_alphas , path.impurities ## getting alpha and impurity values
pd.DataFrame(path) ## showing path as data frame
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000184 | 0.000552 |
| 2 | 0.000229 | 0.001009 |
| 3 | 0.000245 | 0.001499 |
| 4 | 0.000257 | 0.002013 |
| 5 | 0.000262 | 0.002537 |
| 6 | 0.000262 | 0.003061 |
| 7 | 0.000286 | 0.003632 |
| 8 | 0.000343 | 0.003975 |
| 9 | 0.000371 | 0.004718 |
| 10 | 0.000429 | 0.005146 |
| 11 | 0.000457 | 0.005603 |
| 12 | 0.000467 | 0.006070 |
| 13 | 0.000470 | 0.009831 |
| 14 | 0.000488 | 0.010318 |
| 15 | 0.000495 | 0.011309 |
| 16 | 0.000508 | 0.011817 |
| 17 | 0.000583 | 0.012400 |
| 18 | 0.000653 | 0.013053 |
| 19 | 0.000667 | 0.015723 |
| 20 | 0.000989 | 0.016712 |
| 21 | 0.000994 | 0.017706 |
| 22 | 0.001000 | 0.018706 |
| 23 | 0.001195 | 0.021097 |
| 24 | 0.001625 | 0.022723 |
| 25 | 0.001782 | 0.024505 |
| 26 | 0.001908 | 0.026413 |
| 27 | 0.002335 | 0.028748 |
| 28 | 0.002970 | 0.031718 |
| 29 | 0.008156 | 0.039874 |
| 30 | 0.025722 | 0.091318 |
| 31 | 0.034690 | 0.126007 |
| 32 | 0.047561 | 0.173568 |
### Lets plot Total Impurities of leaves v/s effective alpha
fig,ax = plt.subplots(figsize=(10,4)) ## increase size of plot for better visibility
ax.plot(ccp_alphas[:-1],impurities[:-1],marker="o",drawstyle="steps-post") ## plotting alpha and impurities
ax.set_title("Total Impurities of leaves v/s effective alpha") ## Setting title
ax.set_xlabel("Effective Alpha") ## setting x label
ax.set_ylabel("Total Impurities of leaves") ## setting y label
Text(0, 0.5, 'Total Impurities of leaves')
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
clfs=[] ## making empty list for classifiers
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1,ccp_alpha=ccp_alpha) ## fitting for all alphas
clf.fit(X_train,y_train)
clfs.append(clf) ## append list of classifiers
print("Number of nodes in last tree are {} with alpha of {}".format(clfs[-1].tree_.node_count,ccp_alphas[-1]))
Number of nodes in last tree are 1 with alpha of 0.04756053380018527
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs=clfs[:-1] ## here we are neglecting the last one as it will trim all the tree
ccp_alphas =ccp_alphas[:-1]
node_count = [clf.tree_.node_count for clf in clfs] ## getting node counts
max_depth = [clf.tree_.max_depth for clf in clfs] ## getting max depths
fig,ax = plt.subplots(2,1,figsize=(8,4))
ax[0].plot(ccp_alphas,node_count,marker="o",drawstyle="steps-post") ## plotting nodes with alphas
ax[0].set_xlabel("Alpha") ## x label
ax[0].set_ylabel("Node Counts") ## ylabel
ax[0].set_title("Alpha vs Node Counts") ## setting title
ax[1].plot(ccp_alphas,max_depth,marker="o",drawstyle="steps-post") ## plotting alphas with max depth
ax[1].set_xlabel("Alpha") ## x label
ax[1].set_ylabel("Max Depth") ## setting y lable
Text(0, 0.5, 'Max Depth')
recall_train=[] ## Setting up a empoty list of recall values for training data
for clf in clfs:
pred_train = clf.predict(X_train) ## predicting from the predictors using clf
value_train = recall_score(y_train,pred_train) ## calculating recall score
recall_train.append(value_train) ## appending values in list
recall_test=[] ## Setting up a empoty list of recall values for testing data
for clf in clfs:
pred_test = clf.predict(X_test) ## predicting from the predictors using clf
value_test = recall_score(y_test,pred_test) ## calculating recall score
recall_test.append(value_test) ## appending values in list
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
# Plotting for alpha and recall for training
ax.plot(ccp_alphas, recall_train, marker="o", drawstyle="steps-post", label="Training Data")
# Plotting for alpha and recall for testing
ax.plot(ccp_alphas, recall_test, marker="o", drawstyle="steps-post", label="Testing Data")
# Setting labels for axes
ax.set_ylabel("Recall Value")
ax.set_xlabel("Effective Alpha")
# Adding a legend
ax.legend()
<matplotlib.legend.Legend at 0x1bb6c9ac6d0>
best_index = np.argmax(recall_test) ## finding index of alpha where recall for test was best
best_alpha = clfs[best_index] ## getting best classifier
print("Hence our best alpha from cost complexity pruning comes out to be {}".format(best_alpha.ccp_alpha)) ## printing result
Hence our best alpha from cost complexity pruning comes out to be 0.0006674876847290641
estimator_2 = DecisionTreeClassifier(
ccp_alpha=best_alpha.ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1
) ## initializing decision tree with correct class weight and the best alpha
estimator_2.fit(X_train, y_train) ## fitting model for post pruning
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
class_weight={0: 0.15, 1: 0.85}, random_state=1)show_me_tree(estimator_2,X_train) ## plotting decision tree for post pruned model
### lets see the rules for same tree
print(tree.export_text(estimator_2,feature_names=X_train.columns.tolist(),show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [369.60, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 81.50 | | | |--- Age <= 36.50 | | | | |--- Family <= 3.50 | | | | | |--- Education_3 <= 0.50 | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | |--- Education_3 > 0.50 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- weights: [1.65, 0.00] class: 0 | | | |--- Age > 36.50 | | | | |--- weights: [9.15, 0.00] class: 0 | | |--- Income > 81.50 | | | |--- CCAvg <= 4.40 | | | | |--- Age <= 46.00 | | | | | |--- Income <= 90.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Income > 90.50 | | | | | | |--- weights: [0.60, 1.70] class: 1 | | | | |--- Age > 46.00 | | | | | |--- Family <= 1.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- weights: [0.90, 3.40] class: 1 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | |--- Family > 1.50 | | | | | | |--- Mortgage <= 154.00 | | | | | | | |--- weights: [0.45, 7.65] class: 1 | | | | | | |--- Mortgage > 154.00 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | |--- CCAvg > 4.40 | | | | |--- weights: [2.40, 0.00] class: 0 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 101.50 | | | | | |--- CCAvg <= 2.95 | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- CCAvg > 2.95 | | | | | | |--- weights: [0.15, 2.55] class: 1 | | | | |--- Income > 101.50 | | | | | |--- weights: [71.40, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 103.50 | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | |--- Income > 103.50 | | | | | |--- Income <= 116.50 | | | | | | |--- CCAvg <= 2.85 | | | | | | | |--- Age <= 28.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Age > 28.50 | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | |--- CCAvg > 2.85 | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | |--- Income > 116.50 | | | | | | |--- weights: [0.00, 45.90] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 34.50 | | | | | | |--- Experience <= 2.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- Experience > 2.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | |--- Age > 34.50 | | | | | | |--- Age <= 48.50 | | | | | | | |--- CCAvg <= 3.27 | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- CCAvg > 3.27 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- Age > 48.50 | | | | | | | |--- weights: [0.15, 5.10] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 56.95] class: 1 | |--- Family > 2.50 | | |--- Income <= 112.50 | | | |--- CCAvg <= 2.75 | | | | |--- Income <= 106.50 | | | | | |--- weights: [4.05, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 3.50 | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | |--- Experience > 3.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- Income <= 111.50 | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | |--- Income > 111.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 59.50 | | | | | |--- weights: [0.30, 8.50] class: 1 | | | | |--- Age > 59.50 | | | | | |--- weights: [0.75, 0.00] class: 0 | | |--- Income > 112.50 | | | |--- weights: [0.90, 137.70] class: 1
post_pruning_train = model_perf(estimator_2,X_train,y_train) ## calculing performance of training data after post pruning model
conf_mat(estimator_2,X_train,y_train) ## create confusion matrix for train data
post_pruning_train
| Accuracy | Precision | Recall | f1-score | |
|---|---|---|---|---|
| 0 | 0.993429 | 0.935933 | 1.0 | 0.966906 |
show_me_feature_imp(estimator_2,X_train) ## plotting feature importance of post pruned model
## lets see performance of test data after post pruning
post_pruning_test = model_perf(estimator_2,X_test,y_test) ## calculating performance of test data of post pruned model
conf_mat(estimator_2,X_test,y_test) ## plotting confusion matrix
post_pruning_test
| Accuracy | Precision | Recall | f1-score | |
|---|---|---|---|---|
| 0 | 0.98 | 0.88 | 0.916667 | 0.897959 |
## comparision
model_compare = pd.concat([basic_decision_tree_perf_train.T,basic_decision_tree_perf_test.T,decision_tree_tune_perf_train.T,decision_tree_tune_perf_test.T,post_pruning_train.T,post_pruning_test.T],axis=1)
model_compare.columns = ["Basic Train ","Basic Test","Pre Pruning Train","Pre Pruning Test","Post Pruning Train","Post Pruning Test"]
model_compare
| Basic Train | Basic Test | Pre Pruning Train | Pre Pruning Test | Post Pruning Train | Post Pruning Test | |
|---|---|---|---|---|---|---|
| Accuracy | 1.0 | 0.982667 | 0.978286 | 0.960667 | 0.993429 | 0.980000 |
| Precision | 1.0 | 0.946970 | 0.871429 | 0.777778 | 0.935933 | 0.880000 |
| Recall | 1.0 | 0.868056 | 0.907738 | 0.826389 | 1.000000 | 0.916667 |
| f1-score | 1.0 | 0.905797 | 0.889213 | 0.801347 | 0.966906 | 0.897959 |
## Hence below is our final model
best_model = estimator_2
best_model
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
class_weight={0: 0.15, 1: 0.85}, random_state=1)show_me_tree(best_model,X_train)
### lets see the rules for same tree
print(tree.export_text(best_model,feature_names=X_train.columns.tolist(),show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [369.60, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 81.50 | | | |--- Age <= 36.50 | | | | |--- Family <= 3.50 | | | | | |--- Education_3 <= 0.50 | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | |--- Education_3 > 0.50 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- weights: [1.65, 0.00] class: 0 | | | |--- Age > 36.50 | | | | |--- weights: [9.15, 0.00] class: 0 | | |--- Income > 81.50 | | | |--- CCAvg <= 4.40 | | | | |--- Age <= 46.00 | | | | | |--- Income <= 90.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Income > 90.50 | | | | | | |--- weights: [0.60, 1.70] class: 1 | | | | |--- Age > 46.00 | | | | | |--- Family <= 1.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- weights: [0.90, 3.40] class: 1 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | |--- Family > 1.50 | | | | | | |--- Mortgage <= 154.00 | | | | | | | |--- weights: [0.45, 7.65] class: 1 | | | | | | |--- Mortgage > 154.00 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | |--- CCAvg > 4.40 | | | | |--- weights: [2.40, 0.00] class: 0 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 101.50 | | | | | |--- CCAvg <= 2.95 | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- CCAvg > 2.95 | | | | | | |--- weights: [0.15, 2.55] class: 1 | | | | |--- Income > 101.50 | | | | | |--- weights: [71.40, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 103.50 | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | |--- Income > 103.50 | | | | | |--- Income <= 116.50 | | | | | | |--- CCAvg <= 2.85 | | | | | | | |--- Age <= 28.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Age > 28.50 | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | |--- CCAvg > 2.85 | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | |--- Income > 116.50 | | | | | | |--- weights: [0.00, 45.90] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 34.50 | | | | | | |--- Experience <= 2.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- Experience > 2.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | |--- Age > 34.50 | | | | | | |--- Age <= 48.50 | | | | | | | |--- CCAvg <= 3.27 | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- CCAvg > 3.27 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- Age > 48.50 | | | | | | | |--- weights: [0.15, 5.10] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 56.95] class: 1 | |--- Family > 2.50 | | |--- Income <= 112.50 | | | |--- CCAvg <= 2.75 | | | | |--- Income <= 106.50 | | | | | |--- weights: [4.05, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 3.50 | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | |--- Experience > 3.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- Income <= 111.50 | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | |--- Income > 111.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 59.50 | | | | | |--- weights: [0.30, 8.50] class: 1 | | | | |--- Age > 59.50 | | | | | |--- weights: [0.75, 0.00] class: 0 | | |--- Income > 112.50 | | | |--- weights: [0.90, 137.70] class: 1
## lets see what features are most important for this model
show_me_feature_imp(best_model,X_train)